Exploiting Wikipedia Knowledge for Conceptual Hierarchical Clustering of Documents

نویسندگان

  • Gerasimos Spanakis
  • Georgios Siolas
  • Andreas Stafylopatis
چکیده

In this paper, we propose a novel method for conceptual hierarchical clustering of documents using knowledge extracted from Wikipedia. The proposed method overcomes the classic bag-of-words models disadvantages through the exploitation of Wikipedia textual content and link structure. A robust and compact document representation is built in real-time using the Wikipedia application programmer’s interface, without the need to store locally any Wikipedia information. The clustering process is hierarchical and extends the idea of frequent items by using Wikipedia article titles for selecting cluster labels that are descriptive and important for the examined corpus. Experiments show that the proposed technique greatly improves over the baseline approach, both in terms of F -measure and entropy on the one hand and computational cost on the other.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Conceptual Hierarchical Clustering of Documents using Wikipedia knowledge

In this paper, we propose a novel method for conceptual hierarchical clustering of documents using knowledge extracted from Wikipedia. A robust and compact document representation is built in real-time using the Wikipedia API. The clustering process is hierarchical and creates cluster labels which are descriptive and important for the examined corpus. Experiments show that the proposed techniqu...

متن کامل

Document Clustering with Explicit Semantic Analysis (ESA)

Document clustering recently became a vital approach as numbers of documents on web and on proprietary repositories are increased in unprecedented manner. The documents that are written in human language generally contain some context and usage of words mainly dependent upon the same context; recently researchers have attempted to enrich document representation via external knowledge base. This...

متن کامل

Automatic Document Topic Identification Using Hierarchical Ontology Extracted from Human Background Knowledge

The rapid growth in the number of documents available to end users from around the world has led to a greatly-increased need for machine understanding of their topics, as well as for automatic grouping of related documents. This constitutes one of the main current challenges in text mining. In this work, a novel technique is proposed, to automatically construct a background knowledge structure ...

متن کامل

خوشه‌بندی اسناد مبتنی بر آنتولوژی و رویکرد فازی

Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...

متن کامل

Clustering Blogs Using Document Context Similarity and Spectral Graph Partitioning

Semantic-based document clustering has been a challenging problem over the past few years and its execution depends on modeling the underlying content and its similarity metrics. Existing metrics evaluate pair wise text similarity based on text content, which is referred as content similarity. The performances of these measures are based on co-occurrences, and ignore the semantics among words. ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Comput. J.

دوره 55  شماره 

صفحات  -

تاریخ انتشار 2012